-
Notifications
You must be signed in to change notification settings - Fork 471
Prune idle sessions before starting new ones #701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/ModelContextProtocol.AspNetCore/HttpServerTransportOptions.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Meir Blachman <[email protected]>
src/ModelContextProtocol.AspNetCore/IdleTrackingBackgroundService.cs
Outdated
Show resolved
Hide resolved
return; | ||
} | ||
|
||
await _idlePruningLock.WaitAsync(cancellationToken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's seemingly a ton of work happening under this pruning lock. Would it work to just have the lock be taken long enough to pick and remove a session, releasing the lock before doing more work like tearing down or using its resources? We'd still loop, just outside of the lock rather than inside, under the assumption that pruning attempts will be successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hope is that typical applications rarely have to enter this lock. And when it is entered, we're staying in it extra-long intentionally by awaiting DisposeSessionAsync(), so we don't allow the creation of new sessions while a bunch of sessions are still rooted. That's why I left the following comment:
// We're intentionally waiting for the idle session to be disposed before releasing the _idlePruningLock to
// ensure new sessions are not created faster than they're removed when we're at or above the maximum idle session count.
If we really wanted to, we could use Monitor.Enter, and move the call to DisposeSessionAsync outside the lock. I'm not sure what you mean by looping outside the lock, but again I don't really see the point of optimizing this path too much when we know due to the _currentIdleSessionCount, that session disposal is not keeping up with session creation. Waiting on the semaphore is the most resource efficient thing we can do with these requests until we catch up with disposals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I see the issue here. We don't want to limit the parallelism of disposals to only when the background timer gets a hold of the semaphore. Having every request that tries to start a new session while at/over the idle limit wait on at least one session getting disposed should be sufficient to slow down session creation without unnecessarily serializing the disposals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the update to dispose outside of the lock, and to acquire the lock inside the outer loop. I reran the benchmarks and it shows about a 10% improvement in session creation when the idle sessions are saturated at the new default 10,000 limit, and over a 100% improvement at the lower 1,000 limit.
There's a consistent increase in variance which is noticeable run-to-run at higher limits and is shown by the stdev measurement stat at every limit, but that makes sense considering we're not waiting in an orderly queue to start a new session and instead have to wait for the tread pool to give priority. It also looks slightly worse at 100,000 and 200,000. But that's now over 10x the default limit, and it's not worse by much.
MaxIdleSessionCount = 10,000 (New default)
$ ./wrk -t32 -c256 -d15s http://172.20.240.1:3001/ --timeout=15s -s scripts/mcp.lua
Running 15s test @ http://172.20.240.1:3001/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 242.49ms 145.27ms 1.89s 80.41%
Req/Sec 37.31 23.68 520.00 76.12%
16503 requests in 15.04s, 5.64MB read
Requests/sec: 1097.16
Transfer/sec: 384.15KB
MaxIdleSessionCount = 100,000 (Old default)
$ ./wrk -t32 -c256 -d15s http://172.20.240.1:3001/ --timeout=15s -s scripts/mcp.lua
Running 15s test @ http://172.20.240.1:3001/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.65s 2.81s 13.59s 67.60%
Req/Sec 2.55 5.87 110.00 92.57%
791 requests in 15.09s, 276.81KB read
Requests/sec: 52.41
Transfer/sec: 18.34KB
MaxIdleSessionCount = 1,000 (Lower than default)
$ ./wrk -t32 -c256 -d15s http://172.20.240.1:3001/ --timeout=15s -s scripts/mcp.lua
Running 15s test @ http://172.20.240.1:3001/
32 threads and 256 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 21.68ms 13.58ms 245.56ms 89.03%
Req/Sec 391.13 93.53 2.20k 84.63%
187993 requests in 15.10s, 64.29MB read
Requests/sec: 12449.69
Transfer/sec: 4.26MB
The main thing is disposing pruned sessions outside the lock. This improved session creation performance when at max idle sessions by about 10% at a 10,000 idle session max and 100% at a 1,000 max
This PR replaces #677. The key difference between this PR and the old one is that extra idle sessions are closed immediately before starting new ones instead of waiting for the IdleTrackingBackgroundService to clean up the next time the periodic timer fires. To achieve this, I made the ConcurrentDictionary tracking the stateful StreamableHttpSessions a singleton service that's also responsible for pruning idle sessions. StreamableHttpSession was previously named
HttpMcpSession<TTransport>
, but it's no longer shared with the SseHandlerYou can look at the description of #677 to see the consequences of creating too many new sessions without first closing and unrooting a corresponding number of idle sessions. The tl;dr is that overallocating could lead to thread pool starvation as hundreds of threads had to wait on the GC to allocate heap space. This thread pool starvation created a vicious cycle because it prevented the IdleTrackingBackgroundService from unrooting the idle sessions causing more of them to get promoted and creating more work for the GC.
In order to reduce contention, I reuse the sorted _idleSessionIds list to find the most idle session to remove next. This list only gets repopulated every 5 seconds on a background loop, or if we run out of new idle sessions to close to make room for new ones. This isn't perfect, because sessions may briefly become active while siting in the _idleSessionIds list, but not get resorted. However, this is only a problem if the server is at the MaxIdleSessionCount, and that every idle session that was less recently active during the last sort has already been closed. Considering a sort should happen at least every 5 seconds when sessions are pruned, I think this is a fair tradeoff to reduce global synchronization on session creation (at least when under the MaxIdleSessionCount) and every time a session becomes idle.
In my testing on a with 16core/64GB VM, a 100,000 idle session limit (the old default) caused the server process to consume 2-3 GB of memory according to Task Manager and limited new session creation rate to about 60 sessions/second after reaching the MaxIdleSessionCount. At a 10,000 idle session limit (the new default), the process memory usage dropped to about 300MB session creation rate increased to about 900 sessions/second. And at the even lower 1,000 idle session limit, the process memory usage dropped further to about 180MB the session creation rate increased again to about 5,000 sessions/second. All of these numbers are stable over repeated runs after having reached the MaxIdleSessionCount.
MaxIdleSessionCount = 10,000 (New default)
MaxIdleSessionCount = 100,000 (Old default)
MaxIdleSessionCount = 1,000 (Lower than default)
Single Session Tool Call
The
MaxIdleSessionCount
has no apparent affect on this test, and I wouldn't expect it to, since we still look up existing sessions the same way we did previously.